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' Abstract 

' We consider a generalization of stochastic bandits where the set of arms, X, is allowed to be 

. . ' a generic measurable space and the mean-payoff function is "locally Lipschitz" with respect to a 

^ . dissimilarity function that is known to the decision maker. Under this condition we construct an 

J^'Tj ' arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret 

r> ' bounds compared to previous results for a large class of problems. In particular, our results imply 

■ that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite 
■ " " ' number of global maxima around which the behavior of the function is locally continuous with 

a known smoothness degree, then the expected regret of HOO is bounded up to a logarithmic 
factor by -y/n, i.e., the rate of growth of the regret is independent of the dimension of the space. 
We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our 
basic strategy has quadratic computational complexity as a function of the number of time steps 
and does not rely on the doubling trick. We also introduce a modified strategy, which relies on 
the doubling trick but runs in linearithmic time. Both results are improvements with respect to 
previous approaches. 

1 Introduction 

In the classical stochastic bandit problem a gambler tries to maximize his revenue by sequentially 
playing one of a finite number of slot machines that are associated with initially unknown (and 



*This research was carried out within the INRIA project CLASSIC hosted by Ecole normale superieure and CNRS. 
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potentially different) payoff distributions [26| . Assuming old-fashioned slot machines, the gambler 
pulls the arms of the machines one by one in a sequential manner, simultaneously learning about 
the machines' payoff-distributions and gaining actual monetary reward. Thus, in order to maximize 
his gain, the gambler must choose the next arm by taking into consideration both the urgency of 
gaining reward ("exploitation") and acquiring new information ("exploration"). 

Maximizing the total cumulative payoff is equivalent to minimizing the (total) regret, i.e., min- 
imizing the difference between the total cumulative payoff of the gambler and the one of another 
clairvoyant gambler who chooses the arm with the best mean-payoff in every round. The quality of 
the gambler's strategy can be characterized as the rate of growth of his expected regret with time. 
In particular, if this rate of growth is sublinear, the gambler in the long run plays as well as the 
clairvoyant gambler. In this case the gambler's strategy is called Hannan consistent. 

Bandit problems have been studied in the Bayesian framework [l9| . as well as in the frequentist 
parametric [2^;[H and non-parametric settings 01, and even in non-stochastic scenarios While 
in the Bayesian case the question is whether the optimal actions can be computed efhciently, in the 
frequentist case the question is how to achieve low rate of growth of the regret in the lack of prior 
information, i.e., it is a statistical question. In this paper we consider the stochastic, frequentist, 
non-parametric setting. 

Although the first papers studied bandits with a finite number of arms, researchers have soon real- 
ized that bandits with infinitely many arms are also interesting, as well as practically significant. One 
particularly important case is when the arms are identified by a finite number of continuous-valued 
parameters, resulting in online optimization problems over continuous finite-dimensional spaces. 
Such problems are ubiquitous to operations research and control. Examples are "pricing a new 
product with uncertain demand in order to maximize revenue, controlling the transmission power 
of a wireless communication system in a noisy channel to maximize the number of bits transmit- 
ted per unit of power, and calibrating the temperature or levels of other inputs to a reaction so 
as to maximize the yield of a chemical process" [l^ . Other examples are optimizing parameters 
of schedules, rotational systems, traffic networks or online parameter tuning of numerical methods. 
During the last decades numerous authors have investigated such "continuum-armed" bandit prob- 
lems jj; [m @; [22; 12|. A special case of interest, which forms a bridge between the case of a finite 
number of arms and the continuum-armed setting, is formed by bandit linear optimization, see [l[ 
and the references therein. 

In many of the above-mentioned problems, however, the natural domain of some of the optimiza- 
tion parameters is a discrete set, while other parameters are still continuous-valued. For example, 
in the pricing problem different product lines could also be tested while tuning the price, or in the 
case of transmission power control different protocols could be tested while optimizing the power. 
In other problems, such as in online sequential search, the parameter-vector to be optimized is an 
infinite sequence over a finite alphabet jl^; [3| • 

The motivation for this paper is to handle all these various cases in a unified framework. More 
precisely, we consider a general setting that allows us to study bandits with almost no restriction on 
the set of arms. In particular, we allow the set of arms to be an arbitrary measurable space. Since 
we allow non-denumerable sets, we shall assume that the gambler has some knowledge about the 
behavior of the mean-payoff function (in terms of its local regularity around its maxima, roughly 
speaking) . This is because when the set of arms is uncountably infinite and absolutely no assumptions 
are made on the payoff function, it is impossible to construct a strategy that simultaneously achieves 
sublinear regret for all bandits problems (see, e.g., d Corollary 4]). Whe n the set of arms is a metric 
space (possibly with the power of the continuum) previous works have assumed either the global 
smoothness of the payoff function 0; [21I; [l2j or local smoothness in the vicinity of the maxima . 
Here, smoothness means that the payoff function is either Lipschitz or Holder continuous (locally 
or globally). These smoothness assumptions are indeed reasonable in many practical problems of 
interest. 
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In this paper, we assume that there exists a dissimilarity function that constrains the behavior 
of the mean-payoff function, where a dissimilarity function is a measure of the discrepancy between 
two arms that is neither symmetric, nor reflexive, nor satisfies the triangle inequality. (The same 
notion was introduced simultaneously and independently of us by (23l . Section 4.4] under the name 
"quasi-distance.") In particular, the dissimilarity function is assumed to locally set a bound on the 
decrease of the mean-payoff function at each of its global maxima. We also assume that the decision 
maker can construct a recursive covering of the space of arms in such a way that the diameters of 
the sets in the covering shrink at a known geometric rate when measured with this dissimilarity. 



Relation to the literature. Our work generalizes and improves previous works on continuum- 
armed bandits. 

In particular, Kleinberg j2l| and Auer et al. focused on one-dimensional problems, while we 
allow general spaces. In this sense, the closest work to the present contribution is that of Kleinberg 
et al. 2M ' who considered generic metric spaces assuming that the mean-payoff function is Lipschitz 



with respect to the (known) metric of the space; its full version [23| relaxed this condition and only 
requires that the mean-payoff function is Lipschitz at some maximum with respect to some (known) 
dissimilarity^ Kleinberg et al. [l^ proposed a novel algorithm that achieves essentially the best 
possible regret bound in a minimax sense with respect to the environments studied, as well as a 
much better regret bound if the mean-payoff function has a small "zooming dimension" . 
Our contribution furthers these works in two ways: 

(i) our algorithms, motivated by the recent successful tree-based optimization algorithms IS IS; 
[isj . are easy to implement; 

(ii) we show that a version of our main algorithm is able to exploit the local properties of the 
mean-payoff function at its maxima only, which, as far as we know, was not investigated in 
the approach of Kleinberg et al. 0, [2^ . 

The precise discussion of the improvements (and drawbacks) with respect to the papers by 
Kleinberg et al. [23| requires the introduction of somewhat extensive notations and is therefore 
deferred to Section [5] However, in a nutshell, the following can be said. 

First, by resorting to a hierarchical approach, we arc able to avoid the use of the doubling trick, as 
well as the need for the (covering) oracle, both of which the so-called zooming algorithm of Kleinberg 
et al. [22| relies on. This comes at the cost of slightly more restrictive assumptions on the mean- 
payoff function, as well as a more involved analysis. Moreover, the oracle is replaced by an a priori 
choice of a covering tree. In standard metric spaces, such as the Euclidean spaces, such trees are 
trivial to construct, though, in full generality they may be difficult to obtain when their construction 
must start from (say) a distance function only. We also propose a variant of our algorithm that has 
smaller computational complexity of order n In n compared to the quadratic complexity of our 
basic algorithm. However, the cheaper algorithm requires the doubling trick to achieve an anytime 
guarantee (just like the zooming algorithm). 

Second, we arc also able to weaken our assumptions and to consider only properties of the mean- 
payoff function in the neighborhoods of its maxima; this leads to regret bounds scaling as O(y^) 
□ when, e.g., the space is the unit hypercube and the mean-payoff function has a finite number of 
global maxima x* around which it is locally equivalent to a function \\x — with some known 
degree a > 0. Thus, in this case, we get the desirable property that the rate of growth of the regret 
is independent of the dimensionality of the input space. (Comparable dimensionality-free rates are 
obtained under different assumptions in (23|.) 



^ The present paper paper is a concurrent and independent work with respect to the paper of Kleinberg, Shvkins, 
and Upfal [H. An extended abstract [H of the latter was published in May 2008 at STOC'08, while the NIPS'08 
version of the present paper was submitted at the beginning of June 2008. At that time, we were not aware of the 
existence of the full version |23|| , which was released in September 2008. 

^We write u„ = 0{vn) when u„ = 0{vn) up to a logarithmic factor. 
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Finally, in addition to the strong theoretical guarantees, we expect our algorithm to work well 
in practice since the algorithm is very close to the recent, empirically very successful tree-search 
methods from the games and planning literature [l6; 17; 27; 11; isf . 



Outline. The outline of the paper is as follows: 

1. In Section [5] we formalize the A'-armcd bandit problem. 

2. In Section [3] we describe the basic strategy proposed, called HOO (hierarchical optimistic 
optimization). 

3. We present the main results in Section 21 We start by specifying and explaining our as- 
sumptions (Section 14. ip under which various regret bounds are proved. Then we prove a 
distribution-dependent bound for the basic version of HOO (Section 14. 2p . A problem with 
the basic algorithm is that its computational cost increases quadratically with the number of 
time steps. Assuming the knowledge of the horizon, we thus propose a computationally more 
efficient variant of the basic algorithm, called truncated HOO and prove that it enjoys a regret 
bound identical to the one of the basic version (Section 14. 3|) while its computational com- 
plexity is only log-linear in the number of time steps. The first set of assumptions constrains 
the mean-payoff function everywhere. A second set of assumptions is therefore presented that 
puts constraints on the mean-payoff function only in a small vicinity of its global maxima; 
we then propose another algorithm, called local-HOO, which is proven to enjoy a regret again 
essentially similar to the one of the basic version (Section 14. 4|) . Finally, we prove the minimax 
optimality of HOO in metric spaces (Section 14. 5 1) . 

4. In Section [5] we compare the results of this paper with previous works. 

2 Problem setup 

A stochastic bandit problem S is a pair B = {X,M), where X is a, measurable space of arms 
and M determines the distribution of rewards associated with each arm. We say that M is a 
bandit environment on X. Formally, M is an mapping X — > A^i(IR), where A^i(R) is the space of 
probability distributions over the reals. The distribution assigned to arm x € X is denoted by Mx- 
We require that for each arm x £ X, the distribution admits a first-order moment; we then 
denote by f{x) its expectation ("mean payoff"). 



The mean-payoff function / thus defined is assumed to be measurable. For simplicity, we shall also 
assume that all Mx have bounded supports, included in some fixed bounded intervaO, say, the unit 
interval [0, 1]. Then, / also takes bounded values, in [0, 1]. 

A decision maker (the gambler of the introduction) that interacts with a stochastic bandit prob- 
lem B plays a game at discrete time steps according to the following rules. In the first round the 
decision maker can select an arm Xi S X and receives a reward Yi drawn at random from Mxi ■ In 
round n > 1 the decision maker can select an arm X„ € X based on the information available up to 
time n, i.e., (A^i, Yi, . . . , X^-i, Y„_i), and receives a reward y„ drawn from Mjf„, independently of 
[Xi, Yi, . . . , Xn-i, Yn-i) given X„. Note that a decision maker may randomize his choice, but can 
only use information available up to the point in time when the choice is made. 

^More generally, our results would also hold when the tails of the reward distributions are uniformly sub-Gaussian. 
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Formally, a strategy of the decision maker in this game ( "bandit strategy" ) can be described by 
an infinite sequence of measurable mappings, ip = (^2, ■ • where ipn maps the space of past 
observations, 

Hn ^{Xx [0,1])""\ 

to the space of probability measures over X. By convention, ipi does not take any argument. A 
strategy is called deterministic if for every n, (p„ is a Dirac distribution. 

The goal of the decision maker is to maximize his expected cumulative reward. Equivalcntly, the 
goal can be expressed as minimizing the expected cumulative regret, which is defined as follows. Let 

/* = sup fix) 
xex 

be the best expected payoff in a single round. At round n. the cumulative regret of a decision maker 
playing B is 

n 

Rn = nf*-Y,Yt, 
t=i 

i.e., the difference between the maximum expected payoff in n rounds and the actual total payoff. 
In the sequel, wc shall restrict our attention to the expected cumulative regret, which is defined as 
the expectation E[_R„] of the cumulative regret _R„. 
Finally, we define the cumulative pseudo-regret as 

n 

t=l 

that is, the actual rewards used in the definition of the regret are replaced by the mean-payoffs of 
the arms pulled. Since (by the tower rule) 

E[Yt] =E[E[Yt\Xt]] ^E[f{Xt)] , 

the expected values E[i?„] of the cumulative regret and E[i?„] of the cumulative pseudo-regret are 
the same. Thus, we focus below on the study of the behavior of E[i?,i] . 

Remark 1 As it is argued in Q/, in many real-world problems, the decision maker is not interested 
in his cumulative regret but rather in its simple regret. The latter can be defined as follows. After n 
rounds of play in a stochastic bandit problem B, the decision maker is asked to make a recommenda- 
tion Zn € X based on the n obtained rewards Yl , . . . , y„ . The simple regret of this recommendation 
equals 

In this paper we focus on the cumulative regret i?„, but all the results can be readily extended to the 
simple regret by considering the recommendation Zn = At„ , where Tn is drawn uniformly at random 
in {1, . . . , n}. Indeed, in this case, 




as is shown in l^. Section 3]. 

3 The Hierarchical Optimistic Optimization (HOO) strategy 

The HOO strategy (cf. Algorithm [T]) incrementally builds an estimate of the mean-payoff function 
/ over X . The core idea (as in previous works) is to estimate / precisely around its maxima, while 
estimating it loosely in other parts of the space X . To implement this idea, HOO maintains a binary 
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tree whose nodes are associated with measurable regions of the arm-space X such that the regions 
associated with nodes deeper in the tree (further away from the root) represent increasingly smaller 
subsets of X. The tree is built in an incremental manner. At each node of the tree, HOO stores some 
statistics based on the information received in previous rounds. In particular, HOO keeps track of 
the number of times a node was traversed up to round n and the corresponding empirical average 
of the rewards received so far. Based on these, HOO assigns an optimistic estimate (denoted by B) 
to the maximum mean-payoff associated with each node. These estimates are then used to select 
the next node to "play" . This is done by traversing the tree, beginning from the root, and always 
following the node with the highest B-value (cf. lines HHHl of Algorithm [T]). Once a node is selected, 
a point in the region associated with it is chosen (linellGp and is sent to the environment. Based on 
the point selected and the received reward, the tree is updated flines [T51 - [55)l . 

The tree of coverings which HOO needs to receive as an input is an infinite binary tree whose 
nodes are associated with subsets of X. The nodes in this tree are indexed by pairs of integers (ft., 
node (/i, i) is located at depth /i ^ from the root. The range of the second index, z, associated 
with nodes at depth h is restricted by 1 ^ i ^ 2^ . Thus, the root node is denoted by (0, 1). By 
convention, {h + 1, 2i — 1) and (/i + 1, 2i) are used to refer to the two children of the node {h, i). Let 
"Ph,! C A' be the region associated with node {h,i). By assumption, these regions are measurable 
and must satisfy the constraints 

ro,i^X, (la) 
Vh,^^rh+l,2^-lUVh,2^, for all ft ^ and 1 s$ i s$ 2'\ (lb) 

As a corollary, the regions 'Ph,i at any level ft ^ cover the space X, 

i=l 

explaining the term "tree of coverings" . 

In the algorithm listing the recursive computation of the i3~values (lines [^5H55)) makes a local 
copy of the tree; of course, this part of the algorithm could be implemented in various other ways. 
Other arbitrary choices in the algorithm as shown here are how tie breaking in the node selection 
part is done (lines , or how a point in the region associated with the selected node is chosen 

fline I16p . We note in passing that implementing these differently would not change our theoretical 
results. 

To facilitate the formal study of the algorithm, we shall need some more notation. In particular, 
we shall introduce time-indexed versions (7^, (iJji,/„). X„, y„, 'Jlh^i{n), etc.) of the quantities used 
by the algorithm. The convention used is that the indexation by n is used to indicate the value 
taken at the end of the n"^ round. 

In particular, Tn is used to denote the finite subtree stored by the algorithm at the end of round 
n. Thus, the initial tree is To = {(0, 1)} and it is expanded round after round as 

Tn = Tn-l U {(iT„, /„)} , 

where {Hn, In) is the node selected in line 1151 We call (i?n. In) the node played in round n. We use 
Xn to denote the point selected by HOO in the region associated with the node played in round n, 
while Yn denotes the received reward. 

Node selection works by comparing _B-valucs and always choosing the node with the highest 
_B-value. The i3-value, Bh^i{n), at node (ft,i) by the end of round n is an estimated upper bound 
on the mean-payoff function at node (ft, i). To define it we first need to introduce the average of the 
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Algorithm 1 The HOO strategy 



Parameters: Two real numbers i^i > and p e (0,1), a sequence (P;i.i)h^o.i^i^2'' of subsets of 
X satisfying the conditions ([Ta|) and (jlbp . 

Auxiliary function Leaf(T): outputs a leaf of T. 

Initialization: T = {(0, 1)} and Bi^2 — ^2,2 = +oo. 

1: for n = 1, 2, ... do > Strategy HOO in round n ^ 1 

2: (/i, i) -H- (0, 1) > Start at the root 

3: P -i— {{h, i)} > P stores the path traversed in the tree 

4: while {h, i) E T do > Search the tree T 

5: if i?/i+i,2i-i > S^+i,2i then > Select the "more promising" child 

6: {h,i) ^ {h + l,2i- I) 

7: else if Bh+i,2i-i < Bh+i,2i then 

8: {h,i) ^ {h + l,2i) 

9: else > Tie-breaking rule 

10: Z ~ Ber(0.5) > e.g., choose a child at random 

11: {h,i) ^ {h + l,2i~ Z) 

12: end if 

13: P ^ PU{{h,i)} 

14: end while 

15: {H, I) {h, i) > The selected node 

16: Choose arm X in Vh,i and play it > Arbitrary selection of an arm 

17: Receive corresponding reward Y 

18: T^T U{(iJ,/)} [> Extend the tree 

19: for all {h, i) E P do > Update the statistics T and Jl stored in the path 

20: T^^i <— Tfi i + 1 t> Increment the counter of node {h, i) 

21: Jlfi i (l — l/T/i infill i + Y/Tii i t> Update the mean 'p./^ i of node {h, i) 

22: end for 

23: for all {h, i) E T do > Update the statistics U stored in the tree 

24: Uh.i iih,i + \J (2 In7i)/T;i.i + v\p^ > Update the J7-value of node (/i, i) 

25: end for 

26: Bh+i,2I-i < — hoo > B~values of the children of the new leaf 

27: Bn+iai ^ +00 

28: T' -^T > Local copy of the current tree T 

29: while T' ^ {(0, 1)} do > Backward computation of the P-values 

30: (/i, i) Leaf(T') > Take any remaining leaf 

31: Bh i min|[//i.i, max|i?/i+i^2i-i, -B/i+i^2i} | > Backward computation 

32: T' T' \ { (/i, «) } > Drop updated leaf (/i, i) 

33: end while 
34: end for 
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rewards received in rounds when some descendant of node {h, i) was chosen (by convention, each 
node is a descendant of itself): 



1 " 

Here, C(/i, i) denotes the set of all descendants of a node (/i, i) in the infinite tree, 

C(/i, i) = {(/i, i)} U C(/i + 1, 2i - 1) U C{h + 1, 2i) , 

and Th^n) is the number of times a descendant of [h, i) is played up to and including round rt, that 
is, 

n 

ThA"^) = 'Yh{Ht,it)ec{h,i)} ■ 

A key quantity determining Bh^i{n) is Uh.i{n), an initial estimate of the maximum of the mean-payoff 
function in the region Vh,i associated with node {h,i): 



UhA^) = < 



^ r \ / 2 In n L 
fJ-hAn) + \l rr. ( -s +vip , iiThA'n')>Q\ 

otherwise. 



In the expression corresponding to the case Th,i(n) > 0, the first term added to the average of 
rewards accounts for the uncertainty arising from the randomness of the rewards that the average 
is based on, while the second term, I'lp'^ , accounts for the maximum possible variation of the mean- 
payoff function over the region VhA- The actual bound on the maxima used in HOO is defined 
recursively by 



BhA'^) 




|c/ft,j(ri), max{B,i+i,2i-i(7i),B/i+i,2j(n)}|, if {h,i) G Tn] 

otherwise. 



The role of Bh.i{n) is to put a tight, optimistic, high-probability upper bound on the best mean- 
payoff that can be achieved in the region 'Ph,i- By assumption, Vh,i = 7'/i+i,2i-i U Vh+i,2i- Thus, 
assuming that Bh+i,2i-A'^) (resp., Bh+i,2i{n)) is a valid upper bound for region Vh+i,2i-i (resp., 
'Ph+i,2i), we see that ina,x{Bh+i^2i-i{n), Bh+i^2i{n)} must be a valid upper bound for region Vh,i- 
Since Uh,i{n) is another valid upper bound for region Vh,i, we get a tighter (less overoptimistic) 
upper bound by taking the minimum of these bounds. 

Obviously, for leafs {h,i) of the tree %i, one has B}i^i{n) — UhA''^)^ while close to the root one 
may expect that Bh,i{n) < Uh^iin); that is, the upper bounds close to the root are expected to be 
less biased than the ones associated with nodes farther away from the root. 

Note that at the beginning of round n, the algorithm uses i?/i,i(n — 1) to select the node {Hn, In) 
to be played (since Bh,i{n) will only be available at the end of round n). It does so by following 
a path from the root node to an inner node with only one child or a leaf and finally considering a 
child (HnAn) of the latter; at each node of the path, the child with highest B-value is chosen, till 
the node (_ff„,/„) with infinite i?-value is reached. 



Illustrations. Figure [T] illustrates the computation done by HOO in round n, as well as the 
correspondence between the nodes of the tree constructed by the algorithm and their associated 
regions. Figure [5] shows trees built by running HOO for a specific environment. 
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Figure 1: Illustration of the node selection procedure in round n. The tree represents Tn- In 
the illustration, Bh+i,2i-iin — 1) > Bh+i.2i{n — 1), therefore, the selected path included the node 
(h + 1, 2i — 1) rather than the node [h + 1, 2i). 



Computational complexity. At the end of round n, the size of the active tree Tn is at most n, 
making the storage requirements of HOO linear in n. In addition, the statistics and i?-values of 
all nodes in the active tree need to be updated, which thus takes time 0{n). HOO runs in time 
0{n) at each round n, making the algorithm's total running time up to round n quadratic in n. In 
Section 14.31 we modify HOO so that if the time horizon no is known in advance, the total running 
time is 0{nQ In no), while the modified algorithm will be shown to enjoy essentially the same regret 
bound as the original version. 

4 Main results 

We start by describing and commenting on the assumptions that we need to analyze the regret of 
HOO. This is followed by stating the first upper bound, followed by some improvements on the basic 
algorithm. The section is finished by the statement of our results on the minimax optimality of 
HOO. 

4.1 Assumptions 

The main assumption will concern the "smoothness" of the mean-payoff function. However, some- 
what unconventionally, we shall use a notion of smoothness that is built around dissimilarity func- 
tions rather than distances, allowing us to deal with function classes of highly different smoothness 
degrees in a unified manner. Before stating our smoothness assumptions, we define the notion of a 
dissimilarity function and some associated concepts. 

Definition 2 (Dissimilarity) A dissimilarity £ over X is a non-negative mapping £ : — > R 
satisfying £{x, x) = for all x €z X . 
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Figure 2: The trees (bottom figures) built by HOO after 1,000 (left) and 10,000 (right) rounds. The 
mean-payoff function (shown in the top part of the figure) is a; € [0, 1] i — > l/2(sin(13x) sin(27a;)-|-l); 
the corresponding payoffs are Bernoulli-distributed. The inputs of HOO are as follows: the tree of 
coverings is formed by all dyadic intervals, vi = 1 and p = 1/2. The tie-breaking rule is to choose 
a child at random (as shown in the Algorithm [T]) , while the points in X to be played are chosen as 
the centers of the dyadic intervals. Note that the tree is extensively refined where the mean-payoff 
function is near-optimal, while it is much less developed in other regions. 



Given a dissimilarity the diameter of a subset A oi X 'ss measured by I is defined by 

diam(A) = sup £(a;, y) , 

while the £-open ball of X with radius e > and center x £ X is defined by 

B{x,e) ^{yeX : e{x,y) < e} . 

Note that the dissimilarity £ is only used in the theoretical analysis of HOO; the algorithm docs not 
require i as an explicit input. However, when choosing its parameters (the tree of coverings and the 
real numbers i^i > and p < 1) for the (set of) two assumptions below to be satisfied, the user of 
the algorithm probably has in mind a given dissimilarity. 

However, it is also natural to wonder what is the class of functions for which the algorithm (given 
a fixed tree) can achieve non-trivial regret bounds; a similar question for regression was investigated 
e.g., by Yang [ll]. We shall indicate below how to construct a subset of such a class, right after 
stating our assumptions connecting the tree, the dissimilarity, and the environment (the mean-payoff 
function). Of these. Assumption IA2l will be interpreted, discussed, and equivalently reformulated 
below into (|H), a form that might be more intuitive. The form ^ stated below will turn out to be 
the most useful one in the proofs. 

Assumptions Given the parameters of HOO, that is, the real numbers vi > and p G (0, 1) 
and the tree of coverings (Vh.i), there exists a dissimilarity function £ such that the following two 
assumptions are satisfied. 

Al. There exists 1^2 > such that for all integers /i ^ 0, 

(a) diamiVh.t) ^ vip^ for alH = 1, . . . , 2^; 
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(b) for alH = 1, . . . , 2'', there exists ^ £ 'Ph,i such that 

(c) Bh,r n Bh.j = for all 1 i < j 2'' . 

A2. The mean-payoff function / satisfies that for all x,y € X , 

r - f{y) < r - fix) + max{/* - fix), l{x, y)] . (3) 



We show next how a tree induces in a natural way first a dissimilarity and then a class of 
environments. For this, we need to assume that the tree of coverings (Vh.i) "in addition to (|Ta|) 
and ([Tb)) - is such that the subsets Vh,i and Vhj are disjoint whenever 1 ^ i < j ^ 2'^ and that 
none of them is empty. Then, each x € X corresponds to a unique path in the tree, which can be 
represented as an infinite binary sequence xoXiX2 ■ ■ ■, where 

l + (2rEo + l) } ' 
l + (4xo + 2xi+l) } ' 



Xo 

Xl 

X2 



= I 



If 



For points x,y G X with respective representations xqXi . . . and yoyi . . ., we let 

oo 

^{x, y) = (1 - p)iyi hx^=iv^}p'' ■ 
h=a 

It is not hard to see that this dissimilarity satisfies fAll Thus, the associated class of environments C 
is formed by those with mean-payoff functions satisfving IA2I with the so-defined dissimilarity. This 
is a "natural class" underlying the tree for which our tree-based algorithm can achieve non-trivial 
regret. (However, we do not know if this is the largest such class.) 

In general. Assumption lAl I ensures that the regions in the tree of coverings (Vh,i) shrink exactly 
at a geometric rate. The following example shows how to satisfy lAll when the domain A" is a 
Z?-dimcnsional hyper- rectangle and the dissimilarity is some positive power of the Euclidean (or 
supremum) norm. 

Example 1 Assume that X is a D-dimension hyper-rectangle and consider the dissimilarity £{x, y) = 
b\\x — yllj, where a > and 6 > are real numbers and || • II2 is the Euclidean norm. Define the tree 
of coverings iVh,i) in the following inductive way: let Vo,i = X . Given a node Vh.i, let Vh+i,2i-i 
and Vh+i,2i be obtained from the hyper-rectangle Vha by splitting it in the middle along its longest 
side (ties can be broken arbitrarily). 

We now argue that Assum'Dtion \Al\ is satisfied. With no loss of generality we take X ~ [0, 1]^. 
Then, for all integers u ^ and ^ k ^ D ~ 1, 



diam(T'„_D+fe,i 




It is now easy to see that Assumvtion \A1\ is satisfied for the indicated dissimilarity, e.g., with the 
choice of the parameters p = 2~"/^ and vi =b {2y/D) for HOO, and the value V2 = 6/2". 
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Example 2 In the same setting, with the same tree of coverings [Vh.i) over X = [0, 1]^, but now 
with the dissimilarity i{x,y) — b\\x — y\\'^, we get that for all integers and ^ k ^ D — 1, 

diam(7'„Z3+A;a) = b 

This time, Assumption \Al\ is satisfied, e.g., with the choice of the parameters p = 2^°"^-^ and 
1/1=52'' for HOO, and the value V2 = 6/2". 

The second assumption. IA2[ concerns the envhonment; when Assumption I A2I is satisfied, we say 
that / is weakly Lipschitz with respect to (w.r.t.) I. The choice of this terminology follows from the 
fact that if / is 1-Lipschitz w.r.t. £, i.e., for all x,y <^ X, one has \ f{x) — f{y)\ ^ £{x,y), then it is 
also weakly Lipschitz w.r.t. £. 

On the other hand, weak Lipschitzness is a milder requirement. It implies local (one-sided) 1- 
Lipschitzncss at any global maximum, since at any arm x* such that f{x*) = /*, the criterion (j3|) 
rewrites to f{x*) — f{y) ^ £{x*,y). In the vicinity of other arms x, the constraint is milder as the 
arm x gets worse (as /* — f{x) increases) since the condition ^ rewrites to 

VyeA-, f{x)~f{y)^m'A^{f*-f{x),l{x,y)]. (4) 

Here is another interpretation of these two facts; it will be useful when considering local assump- 
tions in Section 14.41 (a weaker set of assumptions). First, concerning the behavior around global 
maxima. Assumption IA2I implies that for any set A C X with sup^g_4 fi^) — /*i 

/* - inf f(x) ^ diam(y^). (5) 

Second, it can be seen that Assumption IA2I is equivalcniQ to the following property: for all x ^ X 
and e ^ 0, 

^(-' + ^) ^'^<r-/(.))+e (6) 

where 

X,^ {xeX : fix);^ f* ^e} 

denotes the set of e -optimal arms. This second property essentially states that there is no sudden 
and large drop in the mean-payoff function around the global maxima (note that this property can 
be satisfied even for discontinuous functions). 

Figure [3] presents an illustration of the two properties discussed above. 

Before stating our main results, we provide a straightforward, though useful consequence of 
Assumptions ED and [X2l which should be seen as an intuitive justification for the third term in ([2]). 
For all nodes {h,i), let 

fh,i = sup fix) and A,, = /* - f^, . 

Ah,i is called the suboptimality factor of node (h, i). Depending whether it is positive or not, a node 
(/i,z) is called suboptimal {Afi,i > 0) or optimal (A^ ^ ~ 0). 

■'That Assumption ! A2l implics | [6t is immediate; for the converse, it suffices to consider, for each y X, the sequence 

en = [e{x,y) - (r - f{x)))^ + 1/n, 
where ( ■ )+ denotes the nonnegative part. 
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Figure 3: Illustration of the property of weak Lipschitzness (on the real line and for the distance 
t{x,y) — \x — y\). Around the optimum x* the values /(y) should be above /* — £{x*,y). Around 
any e-optimal point x the values f{y) should be larger than /* — 2e for £{x, y) ^ e and larger than 
/(.t) — £{x,y) elsewhere. 



Lemma 3 Under Assumptions \A1\ and \A°A if the suboptimality factor A^.i of a region Vh,i is 
bounded by ci^ip'' for some c ^ 0, then all arms in Vh.i max{2c, c + 1} vip^ -optimal, that is, 

'Ph,i C <%'max{2c,c+l} i^ip'» ■ 

Proof For all 6 > 0, we denote by xf^ .^{6) an clement of Vh.i such that 

By the weak Lipschitz property (Assumption IA2P . it then follows that for all y £ Vh,i, 

r ~ f{y) f{KM) + max{/* - f{xim). y)} 

^ Aft, J + 5 + max{Aft,i + 6, diamT'/i,;} . 

Letting 5 — > and substituting the bounds on the suboptimality and on the diameter of Vh,i (As- 
sumption Al) concludes the proof. ■ 



4.2 Upper bound for the regret of HOO 

Auer et aD [1, Assumption 2] observed that the regret of a continuum- armed bandit algorithm should 



depend on how fast the volumes of the sets of e-optimal arms shrink as e — 0. Here, we capture this 
by defining a new notion, the near-optimality dimension of the mean-payoff function. The connection 



between these concepts, as well as with the zooming dimension defined by Kleinberg et al. [2^, will 
be further discussed in Section [5l Wc start by recalling the definition of packing numbers. 

Definition 4 (Packing number) The e -packing number Af{X ,£, e) of X w.r.t. the dissimilarity I 
is the size of the largest packing of X with disjoint i-open balls of radius e. That is, M{X,£,e) is 
the largest integer k such that there exists k disjoint £-open balls with radius e contained in X . 

We now define the c-near-optimality dimension, which characterizes the size of the sets X^e as a 
function of e. It can be seen as some growth rate in e of the metric entropy (measured in terms of 
£ and with packing numbers rather than covering numbers) of the set of ce-optimal arms. 
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Definition 5 (Near-optimality dimension) For c > the c-near-optiniality dimension of f 
w.r.t. £ equals 

max < 0, lim sup r > . 

\ e^o In(e-i) j 

The following example shows that using a dissimilarity (rather than a metric, for instance) may 
sometimes allow for a significant reduction of the near-optimality dimension. 

Example 3 Let X = [0,1]^ and let f : [0,1]^ [0,1] be defined by f{x) = 1 - ||.t||" for some 
a ^ 1 and some norm \\ ■ \\ on M^. Consider the dissimilarity £ defined by £{x,y) = \\x — y\\°' . We 
shall see in Example^ that f is weakly Lipschitz w.r.t. £ (in a sense however slightly weaker than 
the one given by ([5|) and ^ but sufficiently strong to ensure a result similar to the one of the main 
result, Theorem\^ below). Here we claim that the c-near-optimality dimension (for any c > 0) of 
f w.r.t. i is 0. On the other hand, the c-near-optimality dimension (for any c > Oj of f w.r.t. the 
dissimilarity £' defined, for < b < a, by £'{x,y) = \\x — is (1/6 — l/a)D > 0. In particular, 
when a > 1 and b = 1, the c-near-optimality dimension is (1 — l/a)D. 

Proof (sketch) Fix c > 0. The set X^e is the |j • |j-ball with center and radius (ce)^''", that is, 
the £-baIl with center and radius ce. Its e-packing number w.r.t. i is bounded by a constant 
depending only on D, c and a; hence, the value for the near-optimality dimension w.r.t. the 
dissimilarity £. 

In case of £' , we are interested in the packing number of the || • ||-ball with center and radius 
(ce)^/" w.r.t. ^'-balls. The latter is of the order of 

/(C£)^\°_ D/a/ -l\{l/b-l/a)D 

hence, the value (1/fo — l/a)D for the near-optimality dimension in the case of the dissimilarity 
£'. 

Note that in all these cases the c-near-optimality dimension of / is independent of the value of 
c. ■ 



We can now state our first main result. The proof is presented in Section [A. II 

Tlieorem 6 (Regret bound for HOO) Consider HO tuned with parameters such that Assump- 
tions and\^^ hold for some dissimilarity £. Let d be the Ah'i/i'2 -near-optimality dimension of 
the mean-payoff function f w.r.t. £. Then, for all d' > d, there exists a constant 7 such that for all 
n ^ 1, 

E[i?„] s=:7n(^'+i)/('^'+2) (lnn)'/^''+''. 

Note that if d is infinite, then the bound is vacuous. The constant 7 in the theorem depends on d' 
and on all other parameters of HOO and of the assumptions, as well as on the bandit environment 
M. (The value of 7 is determined in the analysis; it is in particular proportional to V2'^ .) The 
next section will exhibit a refined upper bound with a more explicit value of 7 in terms of all these 
parameters. 

Remark 7 The tuning of the parameters of HOO is critical for the assumptions to be satisfied, thus 
to achieve a good regret; given some environment, one should select the parameters of HOO such 
that the near-optimality dimension of the mean-payoff function is minimized. Since the mean-payoff 
function is unknown to the user, this might be difficult to achieve. Thus, ideally, these parameters 
should be selected adaptively based on the observation of some preliminary sample. For now, the 
investigation of this possibility is left for future work. 
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4.3 Improving the running time when the time horizon is known 



A deficiency of tlie basic HOO algorithm is that its computational complexity scales quadratically 
with the number of time steps. In this section we propose a simple modification to HOO that 
achieves essentially the same regret as HOO and whose computational complexity scales only log- 
linearly with the number of time steps. The needed amount of memory is still linear. We work 
out the case when the time horizon, tiq, is known in advance. The case of unknown horizon can be 
dealt with by resorting to the so-called doubling trick, see, e.g., [l3, Section 2.3], which consists of 
periodically restarting the algorithm for regimes of lengths that double at each such fresh start, so 
that the r^^ instance of the algorithm runs for 2*" rounds. 

We consider two modifications to the algorithm described in Section [3l First, the quantities 
Uh,i{n) of ([2]) are redefined by replacing the factor Inn by Inrto, that is, now 



TT r \ - / X , 2 In no J, 
Th,i[n) 

(This results in a policy which explores the arms with a slightly increased frequency.) The definition 
of the _B-values in terms of the Uh,i{n) is unchanged. A pleasant consequence of the above modi- 
fication is that the i3-value of a given node changes only when this node is part of a path selected 
by the algorithm. Thus at each round n, only the nodes along the chosen path need to be updated 
according to the obtained reward. 

However, and this is the reason for the second modification, in the basic algorithm, a path at 
round n may be of length linear in ?i (because the tree could have a depth linear in n) . This is why 
we also truncate the trees 7^ at a depth of the order of In no . More precisely, the algorithm now 
selects the node (i?n, In) to pull at round n by following a path in the tree Tn-i, starting from the 
root and choosing at each node the child with the highest B-value (with the new definition above 
using In no), and stopping either when it encounters a node which has not been expanded before or 
a node at depth equal to 

_ [ (lnno)/2-ln(lM) ' 
ln(l/p) 

(It is assumed that no > l/^^i so that Dng ^ 1.) Note that since no child of a node {Dno,i) located 
at depth Z?„q will ever be explored, its B-value at round n ^ no simply equals C^r>„(,,i("-)- 

We call this modified version of HOO the truncated HOO algorithm. The computational com- 
plexity of updating all _B-values at each round n is of the order of D„„ and thus of the order of In no. 
The total computational complexity up to round no is therefore of the order of no In no, as claimed 
in the introduction of this section. 

As the next theorem indicates this new procedure enjoys almost the same cumulative regret 
bound as the basic HOO algorithm. 

Theorem 8 (Upper bound on the regret of truncated HOO) Fix a horizon uq such that D no 
1. Then, the regret bound of Theorem\^ still holds true at round no for truncated HOO up to an 
additional additive iy/riQ factor. 



4.4 Local assumptions 

In this section we further relax the weak Lipschitz assumption and require it only to hold locally 
around the maxima. Doing so, we will be able to deal with an even larger class of functions and 
in fact we will show that the algorithm studied in this section achieves a 0(-y/n) bound on the 
regret regret when it is used for functions that are smooth around their maxima (e.g., equivalent to 
||a; — a;*||" for some known smoothness degree a > 0). 
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For the sake of simplicity and to derive exact constants we also state in a more explicit way the 
assumption on the ncar-optimality dimension. We then propose a simple and efficient adaptation of 
the HOO algorithm suited for this context. 



4.4.1 Modified set of assumptions 

Assumptions Given the parameters of (the adaption of) HOO, that is, the real numbers vi > 
and p e (0,1) and the tree of coverings {T'h,i), there exists a dissimilarity function £ such that 
Assumption lAll (for some 1^2 > 0) as well as the following two assumptions hold. 

A2'. There exists eo > such that for all optimal subsets A C X (i.e., s\rp^^^f{x) — /*) with 
diameter diam(^) ^ Eq, 

f* - inf f(x) diam(yl) . 
Further, there exists L > such that for all x E X^^ and e G [0,eo], 

B{x, f* - f(x)+e) <Z X I , ^ N . 

A3. There exist C > and d > such that for all e ^ Eq, 

N{x,,, e, e) s;C£-^ 

where c = ALi'i/i'2- 

When / satisfies Assumption IA2'1 we say that / is eo-locally L~weakly Lipschitz w.r.t. £. Note 
that this assumption was obtained by weakening the characterizations and ([5]) of weak Lipschitz- 
ness. 

Assumption IA3I is not a real assumption but merely a reformulation of the definition of near 
optimality (with the small added ingredient that the limit can be achieved, see the second step of 
the proof of Theorem [6] in Section fA.ip . 

Example 4 We consider again the domain X and function f studied in Example and prove ( as 
announced beforehand) that f is £Q-locally 2^^^^ -weakly Lipschitz w.r.t. the dissimilarity £ defined 
by £{x, y) = |jx — which, in fact, holds for all sq. 

Proof Note that 2;* = (0, . . . , 0) is such that f* = l = f{^*)- Therefore, for all x £ X, 

r-fix)^\\xr = £{x\x), 

which yields the first part of Assumption ! A2' I To prove that the second part is true for L = 2"~^ 
and with no constraint on the considered e, we first note that since a ^ 1, it holds by convexity 
that {u + v)" < 2°^^(u" +u'') for all u,v^O. Now, for all e > and y G B{x, WxW" + s), i.e., y 
such that £{x, y) = \\x — yW" < Ha;!]" + e, 

f* - fiy) = \\y\r < (ii^ii + 11^ - y\\y ^ + h - y\n < 2--\2\\xr + e) , 

which concludes the proof of the second part of IA2'I ■ 
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4.4.2 Modified HOO algorithm 



Wc now describe the proposed modifications to tlie basic HOO algorithm. 

Wc first consider, as a building block, the algorithm called z-HOO, which takes an integer z as 
an additional parameter to those of HOO. Algorithm z-HOO works as follows: it never plays any 
node with depth smaller or equal to z — 1 and starts directly the selection of a new node at depth z. 
To do so, it first picks the node at depth z with the best B-value, chooses a path and then proceeds 
as the basic HOO algorithm. Note in particular that the initialization of this algorithm consists (in 
the first 2^ rounds) in playing once each of the 2^ nodes located at depth z in the tree (since by 
definition a node that has not been played yet has a i?- value equal to +oo). We note in passing 
that when z = 0, algorithm z-HOO coincides with the basic HOO algorithm. 

Algorithm local-HOO employs the doubling trick in conjunction with consecutive instances of 
z-HOO. It works as follows. The integers r ^ 1 will index different regimes. The r*^ regime starts 
at round 2'' — 1 and ends when the next regime starts; it thus lasts for 2"^ rounds. At the beginning 
of regime r, a fresh copy of z^-HOO, where z^ = \^og2 r~\ , is initialized and is then used throughout 
the regime. 

Note that each fresh start needs to pull each of the 2^'' nodes located at depth z^ at least once (the 
number of these nodes is « r). However, since round r lasts for 2^ time steps (which is exponentially 
larger than the number of nodes to explore), the time spent on the initialization of z^-HOO in any 
regime r is greatly outnumbered by the time spent in the rest of the regime. 

In the rest of this section, we propose first an upper bound on the regret of z-HOO (with exact 
and explicit constants). This result will play a key role in proving a bound on the performance of 
local-HOO. 

4.4.3 Adaptation of the regret bound 

In the following we write ho for the smallest integer such that 

and consider the algorithm z-HOO, where z ^ hg. In particular, when z = is chosen, the obtained 
bound is the same as the one of Theorem [6l except that the constants are given in analytic forms. 

Theorem 9 (Regret bound for z— HOO) Consider z-HOO tuned with parameters vi and p such 
that Assumptions CO. \A2\ and \A'A hold for some dissimilarity I and the values v^^ L, Eq, C, d. If, 
in addition, z ^ ho and n ^ 2 is large enough so that 

1 ln(4Li^in) — ln(7lnrt) 
d + 2 ln(l/p) ' 

where 

^ (i/p)^+i - 1 UiV 7 ' 

then the following bound holds for the expected regret of z-HOO: 




The proof, which is a modification of the proof to Theorem [SI can be found in Section IA.3I of 
the Appendix. The main complication arises because the weakened assumptions do not allow one to 
reason about the smoothness at an arbitrary scale; this is essentially due to the threshold Sq used in 
the formulation of the assumptions. This is why in the proposed variant of HOO we discard nodes 
located too close to the root (at depth smaller than ho — 1). Note that in the bound the second term 



17 



arises from playing in regions corresponding to the descendants of "poor" nodes located at level z. 
In particular, this term disappears when z = 0, in which case we get a bound on the regret of HOO 
provided that 2vi < eo holds. 

Example 5 We consider again the setting of Examples \^ \^ and^ The domain is X — [0,1]^ 
and the mean-payoff function f is defined by f{x) = 1 — We assume that HOO is run with 

parameters p = (1/4)^/^ and vi = 4. We already proved that Assumvtions HOI \A2] and \A3\ are 
satisfied with the dissimilarity £{x,y) = \\x — the constants 1^2 = 1/4, L = 2, d = 0, an^ 

C = 128^/^, as well as any Eq > (that is, with h^ ~ Oj. Thus, resorting to Theorem\^ (applied 
with z = Q), we obtain 

32 X 128^/2 . 
^= 4V^-1 (4^^ +9) 

and get 



E[i?„] (1 + 42/^)v/327nlnn = -^exp(0(i:>)) nlnn . 

Under the prescribed assumptions, the rate of convergence is of order ^Jn no matter the ambient 
dimension D. Although the rate is independent of D, the latter impacts the performance through 
the multiplicative factor in front of the rate, which is exponential in D. This is, however, not an 
artifact of our analysis, since it is natural that exploration in a D -dimensional space comes at a cost 
exponential in D. (The exploration performed by HOO naturally combines an initial global search, 
which is bound to be exponential in D, and a local optimization, whose regret is of the order of \/n.) 

The following theorem is an almost straightforward consequence of Theorem[9] (the detailed proof 
can be found in Section [A.4l of the Appendix). Note that local-HOO does not require the knowledge 
of the parameter Eq in lA2'l 

Theorem 10 (Regret bound for local-HOO) Consider local-HOO and assume that its param- 
eters are tuned such that Assumptions \A1[ \A2] and \A3[ hold for some dissimilarity i. Then the 
expected regret of local-HOO is bounded (in a distribution- dependent sense) as follows. 



4.5 Minimax optimality in metric spaces 

In this section we provide two theorems showing the minimax optimality of HOO in metric spaces. 
The notion of packing dimension is key. 

Definition 11 (Packing dimension) The £-packing dimension of a set X (w.r.t. a dissimilarity 
£) is defined as 

lnAf{X,£,E) 
hmsup — — — - — . 
e^o In(e-i) 

For instance, it is easy to see that whenever ^ is a norm, compact subsets of with non-empty 
interiors have a packing dimension of D. We note in passing that the packing dimension provides a 
bound on the near-optimality dimension that only depends on X and £ but not on the underlying 
mean-payoff function. 

Let J-'x.e be the class of all bandit environments on X with a weak Lipschitz mean-payoff function 
(i.e., satisfying Assumption IA2p . For the sake of clarity, we now denote, for a bandit strategy cp and 



^To compute C, one can first note that ihvxjv^ = 128; the question at hand for Assumption I A3| to bo satisfied is 
therefore to upper bound the number of balls of radius (w.r.t. the suprcmum norm || ■ ||oo) that can be packed in 



a ball of radius \/128£, giving rise to the bound C ^ \/128^. 
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a bandit environment M on X , the expectation of the cumulative regret of Lp over M at time n by 
Em[-R„(<^)]. 

The following theorem provides a uniform upper bound on the regret of HOO over this class of 
environments. It is a corollary of Theorem[Sl most of the efforts in the proof consist of showing that 
the distribution-dependent constant 7 in the statement of Theorem |9] can be upper bounded by a 
quantity (the 7 in the statement below) that only depends on A", vi, p, I, D' ^ but not on the 
underlying mean-payoff functions. The proof is provided in Section [A. 51 of the Appendix. 

Theorem 12 (Uniform upper bound on the regret of HOO) Assume that X has a finite i- 
packing dimension D and that the parameters of HOO are such that \ All is satisfied. Then, for all 
D' > D there exists a constant 7 such that for all n ^ 1, 

sup EA/[i?„(HOO)] ^ jn^D'+i)/iD'+2) 

The next result shows that in the case of metric spaces this upper bound is optimal up to a 
multiplicative logarithmic factor. Similar lower bounds appeared in [21[ (for D = 1) and in [22|. We 
propose here a weaker statement that suits our needs. Note that if is a large enough compact subset 
of M.^ with non-empty interior and the dissimilarity i is some norm of , then the assumption of 
the following theorem is satisfied. 

Theorem 13 (Uniform lower bound) Consider a set X equipped with a dissimilarity £ that is a 
metric. Assume that there exists some constant c £ (0, 1] such that for all s ^ 1, the packing numbers 
satisfy J\f{X,£,e) ^ ce~^ ^ 2. Then, there exist two constants N{c,D) and j{c, D) depending only 
on c and D such that for all bandit strategies tp and all n ^ N{c, D), 

sup EM[i?„(^)] ^7(c,/?)n(^+i)/(^+2). 

The reader interested in the explicit expressions of N{c, D) and 7(0, D) is referred to the last 
lines of the proof of the theorem in the Appendix. 

5 Discussion 

In this section we would like to shed some light on the results of the previous sections. In particular 
we generalize the situation of Example [SJ discuss the regret that we can obtain, and compare it with 
what could be obtained by previous works. 

5.1 Examples of regret bounds for functions locally smooth at their max- 
ima 

We equip X = [0, 1]^ with a norm || • |j. We assume that the mean-payoff function / has a finite 
number of global maxima and that it is locally equivalent to the function |ja; — -with degree 
a € [0,00)- around each such global maximum x* of /; that is, 

f{x*)- f{x) = Q{\\x- x*r) as x^x*. 

This means that there exist ci,C2,S > such that for all x satisfying ||.t — ^ S, 

C2\\x - a;*|r s$ fix*) - fix) s$ ci||a; - x*^ ■ 

In particular, one can check that Assumption lA2l is satisfied for the dissimilarity defined by ic/six, y) = 
c\\x — 2/11^, where (3 ^ a (and c ^ Ci when /3 = a). Wc further assume that HOO is run with pa- 
rameters vi and p and a tree of dyadic partitions such that Assumption lAll is satisfied as well (see 
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Examples [T] and [5] for explicit values of these parameters in the case of the Euclidean or the supre- 
mum norms over the unit cube). The following statements can then be formulated on the expected 
regret of HOO. 

• Known smoothness: If we know the true smoothness of / around its maxima, then we set 
P = a and c ^ ci. This choice Id, a of a dissimilarity is such that / is locally weak-Lipschitz 
with respect to it and the near-optimality dimension is c? = (cf. Example [3]) . Theorem 1101 
thus implies that the expected regret of local-HOO is 0{y/n), i.e., the rate of the bound is 
independent of the dimension D. 

• Smoothness underestimated: Here, we assume that the true smoothness of / around its 
maxima is unknown and that it is underestimated by choosing (3 < a (and some c). Then 
/ is still locally weak-Lipschitz with respect to the dissimilarity £c,^ and the near-optimality 
dimension is c? = D{1/ P — 1/a), as shown in Example[3l the regret of HOO is O {rS'^'^^^ / ^'^^'^^^ . 

• Smoothness overestimated: Now, if the true smoothness is overestimated by choosing 
(3 > a or a ~ P and c < ci, then the assumption of weak Lipschitzness is violated and we 
are unable to provide any guarantee on the behavior of HOO. The latter, when used with 
an overestimated smoothness parameter, may lack exploration and exploit too heavily from 
the beginning. As a consequence, it may get stuck in some local optimum of /, missing the 
global one(s) for a very long time (possibly indefinitely). Such a behavior is illustrated in the 
example provided in |l3l| and showing the possible problematic behavior of the closely related 
algorithm UCT of [241 . UCT is an example of an algorithm overestimating the smoothness of 
the function; this is because the B-values of UCT arc defined similarly to the ones of the HOO 
algorithm but without the third term in the definition ([2]) of the f7-values. This corresponds to 
an assumed infinite degree of smoothness (that is, to a locally constant mean-payoff function). 



5.2 Relation to previous works 

Several works 0; [2l|; [l2|; 0; [22j have considered continuum-armed bandits in Euclidean or, more 
generally, normed or metric spaces and provided upper and lower bounds on the regret for given 
classes of environments. 



• Cope [1^ derived a 0{y/ri) bound on the regret for compact and convex subsets of M'' and 
mean-payoff functions with a unique minimum and second-order smoothness. 



• Kleinberg [21| considered mean-payoff functions / on the real line that are Holder continuous 
with degree < a ^ 1. The derived regret bound is 0(71'"+-'^'/'^"+^)). 

• Auer et al. extended the analysis to classes of functions that are equivalent to ||a; — a;*||" 
around their maxima x*, where the allowed smoothness degree is also larger: a € [0, 00). They 
derived the regret bound 

where the parameter /? is such that the Lebesgue measure of e-optimal arm is 0{£^). 



Another setting is the one of [22| and [23|, who considered a space {X,£) equipped with some 
dissimilarity i and assumed that / is Lipschitz w.r.t. £ at some maximum x* (when the latter 
exists and a relaxed condition otherwise), that is, 

Vxex, f{x*)-f{x)^e{x,x*). (7) 

The obtained regret bound is ©(n'^''"''^-'/'^'''*"^-') , where d is the zooming dimension. The latter 
is defined similarly to our near-optimality dimension with the exceptions that in the definition 
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of zooming dimension (i) covering numbers instead of packing numbers are used and (ii) sets 
of the form \ are considered instead of the set Xce- When is a metric space, 

covering and packing numbers are within a constant factor to each other, and therefore, one 
may prove that the zooming and near-optimahty dimensions are also equal. 

For an illustration, consider again the example of Section [01 The result of Auer ct al. jgl shows 
that for D = 1, the regret is Q{y/n) (since here /3 = 1/a, with the notation above). Our result 
extends the ^/n rate of the regret bound to any dimension D. 

On the other hand the analysis of Kleinberg et al. [i^ does not apply because in this example 
f{x*) — f{x) is controlled only when x is close in some sense to x* (i.e., when ||a; — a;*|| ^ 5), 
while ([7]) requires such a control over the whole set X. However, note that the local weak-Lipschitz 
assumption IA2'I requires an extra condition in the vicinity of x* compared to ([7]) as it is based on 
the notion of weak Lipschitzness. Thus, IA2'I and ([7]) are in general incomparable (both capture a 
different phenomenon at the maxima). 

We now compare our results to those of [l^l and [2^ under Assumption |X2] (which does not cover 
the example of Sect ion [5 . 1 1 unless 5 is large). Under this assumption, our algorithms enjoy essentially 



the same theoretical guarantees as the zooming algorithm of [22|; [23[. Further, the following hold 



• Our algorithms do not require the oracle needed by the zooming algorithm. 

• Our truncated HOO algorithm achieves a computational complexity of order 0{n log n), whereas 
the complexity of a naive implementation of the zooming algorithm is likely to be much larger!^ 

• Both truncated HOO and the zooming algorithms use the doubling trick. The basic HOO 
algorithm, however, avoids the doubling trick, while meeting the computational complexity of 
the zooming algorithm. 

The fact that the doubling trick can be avoided is good news since an algorithm that uses the 
doubling trick must start from tabula rasa time to time, which results in predictable, yet inevitable, 
sharp performance drops -a quite unpleasant property. In particular, for this reason algorithms 
that rely on the doubling trick are often neglected by practitioners. In addition, the fact that we 
avoid the oracle needed by the zooming algorithm is attractive as this oracle might be difficult to 
implement for general (non-metric) dissimilarities. 
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A Proofs 

A.l Proof of Theorem [6] (main upper bound on the regret of HOO) 

We begin with three lemmas. The proofs of Lemmas [15] and rely on concentration-of-measure 
techniques, while the one of Lemma UM follows from a simple case study. Let us fix some path (0, 1), 
(2,12), ... of optimal nodes, starting from the root. That is, denoting Iq = 1, we mean that 
for all j ^ I, the suboptimality of (j, i*) equals Aj i* = and (j, i*) is a child of (j — 1, 

Lemma 14 Let {h, i) be a suboptimal node. Let O^fc^/i— 1 he the largest depth such that (fc, i^) 
is on the path from the root (0, 1) to {h,i). Then for all integers 0, we have 



E 



n 

[r,,,,(n)] ^ « + ^ P{ (t) /* for some s e {k + 1, . . . ,t - 1}] 



t=u+l 



or [Th4t) > u and UhAt) > /*] } 



Proof Consider a given round t E {l,...,n}. If {Ht,Lt) G C{h,i), then this is because the child 
{k + 1, i') of {k, il) on the path to {h, i) had a better i?-value than its brother [k + 1, i^j^i)- Since by 
definition, _B-values can only increase on a chosen path, this entails that Bk+i,i'^^_^ ^ Bk+i.i'{t) ^ 
Bh,i{t). This is turns implies, again by definition of the _B-valucs, that Bk+i.i'^^^ (t) ^ Uh.i{t)- Thus, 

{{Huit) e C{Ki)} c {c/mW ^ Bk+i.ri^^{t)} c {UnAt) > f*}u{Bk+l,^'^^^{t) ^ r} . 

But, once again by definition of i3-values, 

{B,+,^,,jt) ^ r} c{c/,+i,..^^w .r}u{Sfe+2,,*^jo r}, 

and the argument can be iterated. Since up to round t no more than t nodes have been played 
(including the suboptimal node {h,i)), we know that (i,?t) has not been played so far and thus 
has a _B-value equal to +00. (Some of the previous optimal nodes could also have had an infinite 
[/-value, if not played so far.) We thus have proved the inclusion 

{{Ht,It)eC{h,i)} c {UhM > ./*} U [{Uk+u,,^^*) sc: /*} u ...u{C/t_i,,._^(t) sc: /*}) . (8) 

Now, for any integer u ^ it holds that 

n n 

Th,ii'^) = ^ I{(Ht,/t)eC{/i,i), Tfc,i(t)^«} + X! him, It) ec{h,i), t^, ;(*)>«} 
t=i t=i 

n 

^ X! hiHt,It)eC{h.i),T,,,i{t)>u} , 



t=u+l 



where we used for the inequality the fact that the quantities Th.i{t) are constant from t to t + 1, 
except when {Ht, It) € C{h, i), in which case, they increase by 1; therefore, on the one hand, at most 
u of the Th.i{t) can be smaller than u and on the other hand, T^A^) > u can only happen ii t > u. 
Using ^ and then taking expectations yields the result. ■ 



Lemma 15 Let Assum'Dtions \Ai\ and \AB hold. Then, for all optimal nodes {h,i) and for all integers 
n ^ 1, 
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Proof On the event that (h, i) was not played during the first n rounds, one has, by convention, 
Uh,i{n) = +00. In the sequel, we therefore restrict our attention to the event {Tft.i(n) ^ l}. 
Lemma [3] with c = ensures that /* — f{x) < I'lp'^ for all arms x € Vh,!- Hence, 

n 

and therefore, 

f{UhAn) ^ r and UAn) ^ l} 

ThA^) phAn) + ThAn) {i^ip'' - /*) < -^2ThAn)^n and nAn) ^ 1 

n n 
.t=l t=l 



^ -^2ThAn) Inn and T,,,,(7i) ^ 1 



^ P|E(-^(^*)~^OlI{(ff*./*)ec(M} ^ V2r,,,(n)lnn and TuAn) > l| ■ 

We take care of the last term with a union bound and the Hoeffding-Azuma inequality for martingale 
differences. 

To do this in a rigorous manner, we need to define a sequence of (random) stopping times when 
arms in C(/i, i) were pulled: 

Tj = min{t : ThA^) = j} , J = 1, 2, ... . 

Note that 1 Ti < r2 < . . ., hence it holds that Tj ^ j. We denote by Xj = Xtj the j"^ arm pulled 
in the region corresponding to C{h,i). Its associated corresponding reward equals Yj ~ Y^^ and 



¥ lY.{f{Xt) -Yt)I^(^H,M)ec{h.^)} > j2ThAn)\nn and nA^) ^ 1 



. t=i 

'Tfe,,(n) 



[f{Xj) - YA ^ J2rM(n)lnn and nAn) > 1 



t 



where we used a union bound to get the last inequality. 
We claim that 



Zt^Y.{!{X,)-Y, 



is a martingale w.r.t. the filtration Qt = fT(Xi, Zi, . . . , Xj, Zj, . This follows, via optional 

skipping (see [13, Chapter VII, adaptation of Theorem 2.3]), from the facts that 



Y^{S{X,)~Y,)\^HMC 
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is a martingale w.r.t. the filtration Tt = cr(Xi, Yi, . . . , Xt, Y^, Xf+i) and that the events \Tj = fc} 
are immeasurable for all k ^ j. 

Applying the Hoeffding-Azuma inequality for martingale differences (see poj). using the bound- 
edness of the ranges of the induced martingale difference sequence, we then get, for each i ^ 1, 



2 (\/2<lnn 

^(/(Xj) - Y,) ^ V2nnn ^ < exp I - 



which concludes the proof. 



Lemma 16 For all integers t ^ n, for all suboptimal nodes {h, i) such that A^^i > vip^ , and for all 
integers u ^ 1 such that 

8 Inn 

one has 

^{UhAt) > r and Th^t) >u}^ tn-\ 
Proof The u mentioned in the statement of the lemma are such that 



2 

Therefore, 



A,,,, - i^ip'' ^ /21nn /21nf A,,, + j/ip'' 
- ^ V , thus W + ^- 



»{C/„,,(i)>r and Th,^{t)>u} 

= r!^nhAt) + ^^^ + ^lp' >fL + ^h,^ and nM>^ 

< pIm/^.^W >/m+ ^"''' o'''^" and r„,,(i)>w 



s; P { Tn,S) (nhAt) ~ fL) > ^ "'^'^ ThAt) and r^,,(0 > u 

^{^s - fhA)h{H,,h)eC{Ki)} > r, Th.i{i) and TuAt) > u 



ThAt) and T,,,,;(i) > 



Now it follows from the same arguments as in the proof of Lemma [15] (optional skipping, the 
Hoeffding-Azuma inequality, and a union bound) that 

- fiXs))l{(H,j^)ec(h.)} > ^ "^'^^ ThAt) and T„,,;(t) > u| 



s'—u--\-l 



± „J_i((to^!),f), ^ „.p(4,,.,,-../,^) 

=u+l \ ^ ^ / s'=ti+l ^ ^ 



f exp ( -iu (A,.,, - i/ip'')^ ) ^ tTi" 



4 
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where we used the stated bound on u to obtain the last inequality. 



Combining the results of Lemmas [141 [13 a-nd [T6| leads to the following key result bounding the 
expected number of visits to descendants of a "poor" node. 

Lemma 17 Under Assumptions and \AS\ for all suboptimal nodes {h,i) with A^^i > vip^ , we 
have, for all n ^ 1, 

„r„ , M 8 Inn 

Proof We take u as the upper integer part of (8 lnn)/(A/i_,; — uip'^)'^ and use union bounds to get 
from Lemma [141 the bound 

„r„ , NT 8 In 71 

n / t-1 \ 



\^{ThAt)>^ and C/^^,(<) >.r}+^P{C/,,,;(i) «;/*} 



t=M + l \ s = l J 

Lemmas [T5| and [T6| further bound the quantity of interest as 
and we now use the crude upper bounds 

n / t—l\ n 

t=u+l \ s=l / t=l 

to get the proposed statement. ■ 

Proof (of Theorem [6]) First, let us fix d' > d. The statement will be proven in four steps. 

First step. For all h = 0, 1, 2, . . ., denote by Xh the set of those nodes at depth h that are 
2j/ip''-optimal, i.e., the nodes {h,i) such that f^i^f* — 2i^ip'\ (Of course, Xq ~ {(0, 1)}.) Then, 
let X be the union of these sets when h varies. Further, let be the set of nodes that are not in X 
but whose parent is in X. Finally, for /i = 1, 2, ... we denote by J'h the nodes in J' that are located 
at depth h in the tree (i.e., whose parent is in I^-i)- 

Lemma 1 171 bounds in particular the expected number of times each node {h,i) e j7/i is visited. 
Since for these nodes A/j ^ > 2vip^ , we get 

„ r„ / NT 8 Inn 

Second step. We bound the cardinality \Xh\ oiXh. We start with the case h ^ 1. By definition, 
when {h,i) E Xh, one has Ah^i ^ 2vip^, so that by Lemma [3] the inclusion Vh,i C <%4yjp'i holds. 
Since by Assumption I All the sets Vh,! contain disjoint balls of radius V2P^ , we have that 

We prove below that there exists a constant C such that for all £ ^ 1^2, 

A^('^(4.iM)e, £-'''. (9) 
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Thus we obtain the bound \Ih\ ^ C (i^2/o'') for all h ^ 1. We note that the obtained bound 

\^h\ ^ C {v2P^) is still valid for /i = 0, since |Io| = 1- 

It only remains to prove ([9]). Since d! > d, where d is the near-optimality of /, we have, by 
definition, that 

limsup^"-^(^^^7/-t'^'^) ^d, 

and thus, there exists Ed' > such that for all e ^ Ed', 

InJVjXiA^ju^),, e) 

In(e-i) ^ ' 

which in turn implies that for all e ^ Ed', 

A^('%'(4.iM)e, A e) ^ 

The result is proved with C = 1 if e^' ^ i^2- Now, consider the case Ed' < 1^2 ■ Given the definition 
of packing numbers, it is straightforward that for all e G [Ed', I'l] , 

M{X(i^^/^^)i,, £, e) ^ ud' ^= Af{X, £, Ed') ; 
therefore, for aU e S [Ed', 1^2] , 

^f{X(iy,/y,)e, £, s) ^ Ud' ^ = Ce''^ 

for the choice C = max{l, Ud' i^i }. Because we take the maximum with 1, the stated inequality 
also holds for e ^ e"'' , which concludes the proof of ©■ 

Third step. Let H ^ 1 he an integer to be chosen later. We partition the nodes of the infinite 
tree T into three subsets, T = U U T"^, as follows. Let the set contain the descendants 
of the nodes in Ih (by convention, a node is considered its own descendant, hence the nodes of 
Xh are included in T^); let = Uo<jft,<ff X^,; and let contain the descendants of the nodes in 
Ui^/i<;h jT/i- Thus, and are potentially infinite, while is finite. 

We recall that we denote by {Ht,It) the node that was chosen by HOO in round t. From the 
definition of the algorithm, each node is played at most once, thus no two such random variables are 
equal when t varies. We decompose the regret according to which of the sets the nodes {Ht,It) 
belong to: 

n 

where i?„ , = ^(/* - /(Xt))I|(^^ j^jeT-.j , fori = 1,2, 3. 

t=i 

The contribution from is easy to bound. By definition any node in Ih is 2i^ip^-optimal. Hence, 
by Lemma [3l the corresponding domain is included in X^^^pH. By definition of a tree of coverings, 
the domains of the descendants of these nodes are still included in X^^_^pH . Therefore, 

For h ^ 0, consider a node {h,i) € T^. It belongs to and is therefore 2i/ip''~optimal. By 
Lemma [31 the corresponding domain is included in X^^^^pU. By the result of the second step of this 



:[i?„] =E 



2^(.r-/(^t)) 
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proof and using that each node is played at most once, one gets 

H-1 H-1 

/i=o h=a 

We finish by bounding the contribution from . We first remark that since the parent of any 
element {h,i) e J7h is in by Lemma [3] again, we have that Vh.i C X^^ip'^-'^- We now use the 

first step of this proof to get 



h=l i:{h,i)^Jh h=l 



Now, it follows from the fact that the parent of Jh is in Ih-i that \Jh\ ^ 2|I/i_i| when h ^ 1. 
Substituting this and the bound on obtained in the second step of this proof, we get 



h=l 



^^h{i-d')+d'-i ( 8 Inn 

h=l 

Fourth step. Putting the obtained bounds together, we get 

H-1 H 

^ ph{i-d')+d'-i ^ 



h=0 h=l 

= o|^np^ + (lnn)^^p-''(i+''')^ = 0(71/,^ + p-^(i+''') Inn) 



(10) 



(recall that p <\). Note that all constants hidden in the O symbol only depend on i^i, 1/2, p and d! . 

Now, by choosing H such that p~^^'^ is of the order of n/lnn, that is, p^ is of the order of 
{n/ hm)^^^^'^ we get the desired result, namely, 

E[i?„] = o(nK+i)/('^'+2) (in„)i/K+2) 



A. 2 Proof of Theorem [8] (regret bound for truncated HOO) 

The proof follows from an adaptation of the proof of Theorem [6] and of its associated lemmas; for 
the sake of clarity and precision, we explicitly state the adaptations of the latter. 

Adaptations of the lemmas. Remember that denotes the maximum depth of the tree, 
given horizon uq. The adaptation of Lemma \JM is done as follows. Let {h^i) be a suboptimal node 
with h ^ Dno and let ^ fc ^ ft. — 1 be the largest depth such that {k,i'^) is on the path from the 
root (0, 1) to {h,i). Then, for all integers u ^ 0, one has 

"0 

^[ThAno)] ^ u + E P| [Us,t'^ (<) ^ /* for some s with fc + 1 ^ s ^ min{Aio, ^o}] 

t=u+l 
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or [ThM > u and UhAt) > /*] } 



As for Lemma [TSl its straightforward adaptation states that under Assumptions lAll and IA21 for 
all optimal nodes with h ^ Dn^ and for all integers 1 ^ t ^ 7io, 

^{UhAt) < r]^t{n^Y^ < {mf. 

Similarly, the same changes yield from Lemma 1161 the following result for truncated HOO. For 
all integers i ^ tiq, for all suboptimal nodes {h,i) such that h ^ Dn^ and A/^^i > I'lp'', and for all 
integers u ^ 1 such that 

8 In uq 

one has 

P{UhAt) > f* and ThAt) >u} (no)"* . 

Combining these three results (using the same methodology as in the proof of Lemma [TTl) shows 
that under Assumptions lAll and IA2[ for all suboptimal nodes {h, i) such that h ^ and lS.h,i > 
I'lp'*, one has 

(We thus even improve slightly the bound of Lemma [TTl ) 



o , -u , min{D„„,rio} 

8 In no , 1 , / , ^4 , n-3 



8 In no , g 



1+ \Amf+ J2 



Adaptation of the proof of Theorem \6[ The main change here comes from the fact that 
trees arc cut at the depth Dn^. As a consequence, the sets X^, J ^ and Jh arc defined only by 
referring to nodes of depth smaller than Dn^. All steps of the proof can then be repeated, except 
the third step; there, while the bounds on the regret resulting from nodes of and go through 
without any changes (as these sets were constructed by considering all descendants of some base 
nodes), the bound on the regret i?„_2 associated with the nodes calls for a modified proof since at 
this stage we used the property that each node is played at most once. But this is not true anymore 
for nodes (ft.,i) located at depth -Dng, which can be played several times. Therefore the proof is 
modified as follows. 

Consider a node at depth h ~ Dng ■ Then, by definition of Dn^ , 

(lnno)/2-ln(l/z/i) , ,, 1 

Dno^ , , , , that is, p'' s$ 



ln(l/p) ' ' ' 

Since the considered nodes are 2i^ip-°"o -optimal, the corresponding domains are 4^ip-°"o -optimal 
by Lemma [3l thus also 4/ y^ng -optimal. The instantaneous regret incurred when playing any of 
these nodes is therefore bounded by i/y/no; and the associated cumulative regret (over no rounds) 
can be bounded by 4Y^no . In conclusion, with the notations of Theorem |6l we get the new bound 

h=Q h=0 

The rest of the proof goes through and only this additional additive factor of 4y^no is suffered in 
the final regret bound. (The additional factor can be included in the O notation.) 
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A. 3 Proof of Theorem O (regret bound for 2;— HOO) 

We start with the following equivalent of Lemma [3] in this new local context. Remember that ho is 
the smallest integer sueh that 



Lemma 18 Under Assumvtions rOl and \A2\ for all h ^ ho, if the suboptimality factor Ah^i of a 
region Vh.i is bounded by ci^ip^ for some c G [0, 2], then all arms in Vh,i are L max{2c, c + 1 } vip^- 
optimal, that is, 

Vh,! C <%Lmax{2c, c+lji/ip'* ■ 

When c = 0, i.e., the node {h,i) is optimal, the bound improves to 

Proof We first deal with the general case of c S [0,2]. By the hypothesis on the suboptimality 
of Vh.i: for all (5 > 0, there exists an element x G '^cuip'^+s ^ Vh.i- If 5 is small enough, e.g., 
5 G (0, eo — 2z/ip'"'], then this element satisfies x e X^^. Let y e Vh.i- By Assumption lAH 
l{x, y) ^ diam(7'/j j) ^ vip*^ , which entails, by denoting e = maxjO, vip*^ — (/* — /(x))}, 

l{x,y)^u^p'' ^ f* - f{x) + e, that is, y G B{x, f* ~ f [x) + e) . 

Since x G Xf.^ and e ^vip^ ^ vip'''-° < Eq, the second part of Assumption IA2'I then yields 

yGB{x,f*~fix)+s) ^^,^,,,._,,^,,,^y 

It follows from the definition of e that /* — f{x) + e = maxj/* — f{x), i^ip''}, and this implies 
yeB{x,f*-f{x) + e) dX ( ^. 

But a; e X^^,_^ph^s, i.e., /* — f{x) ^ cj^ip'* + 6, we thus have proved 

^ ^ '^L(max{2c, c+l}iyip'>+25) 

In conclusion, Vh.i C <%Lmax{2c, c+i}i/ip'>+2L(5 for all sufficiently small 6 > 0. Letting 5 — > concludes 
the proof. 

In the case of c = 0, we resort to the first part of Assumption IA2'1 which can be applied since 
diam('P;i.i) ^ i^ip'' ^ £0 as already noted above, and can exactly be restated as indicating that for 
all y e Vh.i, 

f* - fiy) ^ dmm{Vh.^) ^ i^ip^ ; 
that is, Vh.i C X^^ph. ■ 

We now provide an adaptation of Lemma [17] (actually based on adaptations of Lemmas 1141 
and fTSj) . providing the same bound under local conditions that relax the assumptions of Lemma [17] 
to some extent. 

Lemma 19 Consider a depth z ^ /iq. Under Assumvtions VA1\ and \Ai3i . the algorithm z-HOO 
satisfies that for all 1 and all suboptimal nodes [h, i) with A^^^ > vip^ and z, 

„ r„ / s n 8 In n 

{i\h,% - vip'^y 
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Proof We consider some path {z,il), {z + ... of optimal nodes, starting at depth z. We 

distinguish two cases, depending on whether there exists z ^ k' ^ h — 1 such that (h, i) € C(fc', i^.,) 
or not. 

In the first case, we denote k' the largest such k. The argument of Lemma[T3]can be used without 
any change and shows that for all integers u ^ 0, 

n 

E[Th,,{n)] s$ u + ^ P| (t) < /* for some s € {fe + 1, . . . , t - 1}] 

t=u+l 

or [ThAt)>u and (t) >/*]} . 

In the second case, we denote by {z,ih) the ancestor of {h,i) located at depth z. By definition 
of z-HOO, (Ht, It) € C{h, i) at some round t ^ 1 only if i?z,i; {t) ^ -B^^^^ (i) and since B-values can 
only increase on a chosen path, {Ht,It) g C{h,i) can only happen if Bz.i-^{t) B^ iit). Repeating 
again the argument of Lemma [T31 we get that for all integers u > 0, 

n 

E[T,,,,(n)] < ti+ F{[t/.,»:W for somes e 

t=u+l 

or [r,,,,(i) >^ and UhAt) > f*]] ■ 

Now, notice that Lemma [TO] is valid without any assumption. On the other hand, with the 
modified assumptions. Lemma [T^ is still true but only for optimal nodes with Hq. Indeed, 
the only point in its proof where the assumptions were used was in the fourth line, when applying 
Lemma m here. Lemma ITSl with c = provides the needed guarantee. 

The proof is concluded with the same computations as in the proof of Lemma 1171 ■ 

Proof (of Theorem [9]) We follow the four steps in the proof of Theorem [6] with some slight 
adjustments. In particular, for h ^ z, we use the sets of nodes Ih and j7/i defined therein. 

First step. Lemma [19] bounds the expected number of times each node {h, i) G Ju is visited. 
Since for these nodes A^^i > 2v\p^ , we get 

„ r„ , 8 Inn 

Second step. We bound here the cardinality X^h\. By Lemma [TSl with c = 2, when (/i, i) € Hh 
and z, one has Vh.i C X^Luipi^- 

Now, by Assumption lAll and by using the same argument as in the second step of the proof of 
Theorem [SI 

\Ih\ ^ M'{Xi^4Liyi/i^2)^2p''^ V2P^) ■ 

Assumption IA3I can be applied since i'2P^ ^ 2i'ip'* ^ 2i'ip'^" ^ Eq and yields the inequality \Xfi\ ^ 

Third step. We consider some integer H ^ z to he defined by the analysis in the fourth step. 
We define a partition of the nodes located at a depth equal to or larger than z; more precisely, 

• contains the nodes of and their descendants, 

• = U 

• T'^ contains the nodes jT/i and their descendants. 
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• T'^ is formed by the nodes located at depth z not belonging to Iz^ i.e., such that A^^,; > 
2z^ip^, and their descendants. 

As in the proof of Theorem IH] we denote by R^^i the regret resulting from the selection of nodes in 
r% for i € {1,2,3,4}. 

Lemma [15] with c = 2 yields the bound E[i?„,i] ^ ALvip^n, where we crudely bounded by n 
the number of times that nodes in were played. Using that by definition each node of can be 
played only once, we get 

H-l H-l 
h=z h—z 

As for Rn,3, we also use here that nodes in belong to some J'h, with z + 1 ^ h ^ H; m particular, 
they are the child of some element of Ih-i and as such, firstly, they are 4Li'i/9''^^-optimal (by 
Lemma [T8| and secondly, their number is bounded by ^ 2|J?i-i| ^ 2C[v2p'^~^) Thus, 



h=z + l i:{h,i)eJh h=z + l 



where we used the bound of Lemma[T9l Finally, for T^, we use that it contains at most 2^ — 1 nodes, 
each of them being associated with a regret controlled by Lemma [T^ therefore, 

„ r n / , , / 8 In n 
E[R^,,]^{r-l) (^^+4 

Fourth step. Putting things together, we have proved that 

E[i?„] 5:4Li.ip^n + E[i?„,2] +E[i?„,3] +(2"-l) 



where (using that p < 1 in the second inequality) 



h=z h=z + l 



VlP 



= 4CX.1.- E P"-''-'' + 8^^-!-.-' E P"''-'' + 4) 

h=z h=z \ Ih' H / 

^ ACLv.v^" E 4r + 8CLi.ii.2-'* E P"'-' 

h—z ^ h—z 



8 Inn 4 

j^2p2p2h p2h 



oD H In n 

I'lP^ 

Denoting 

4CL1.1Z.2"'* / 16 



' (l/p)'*+i - 1 \vIp^ 
it follows that for n ^ 2 

E[i?„,2] +E[i?„,3] ^ip-"^''^^^ Inn. 
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It remains to define the parameter H ^ z. In particular, we propose to choose it such that the 
terms 

ALi^ip^n and p-^l-^+i) Inn 

are balanced. To this end, let H be the smallest integer k such that ALuxp^n ^ 'yp~^id.+'^) Inn; in 
particular, 

p" ^' ^ 

and 



\ALvin 



ALvip"-^n>-ip-^"-^^'^'^+^hnn, implying p-"^'^+^^ \nn ALi^ip^n p-^'^+^'> 

Note from the inequality that this H is such that 

1 ln(4Lj/in) — ln(7lnn) 



d + 2 ln(l/p) 



and thus this H satisfies ^ z in view of the assumption of the theorem indicating that n is large 
enough. The final bound on the regret is then 



E sc: ALiy^p^n + 7 p-"(^+'^ \nn+ {¥ - l) + 4^ 



l/(d+2) 

< ( 1 ' ^ 



8 Inn 



pd+2 



/7ln?i\''^"^'^ ^,/8 1nn \ 

(4L.,n)('^^^)/(^+^'(7lnn)V(^+2) + (2^ _ l) + 4) 



This concludes the proof. 



A. 4 Proof of Theorem [TO] (regret bound for local-HOO) 

Proof We use the notation of the proof of Theorem [9] Let tq be a positive integer such that for 
r ^ ro, one has 

defn , ^1 ln(4Li.i2'-)-ln(7ln2'-) 

Zr = I log2 r I ^ /iQ and Zr ^ 



<i+2 ln(l/p) 

we can therefore apply the result of Theorem [9] in regimes indexed by ?' ^ tq. For previous regimes, 
we simply upper bound the regret by the number of rounds, that is, 2''° — 2 ^ 2'"". For round n, we 
denote by r„ the index of the regime where n lies in (regime r„ = [log2(?T. + 1)J)- Since regime r„ 
terminates at round 2''"+^ — 2, we have 

2- + ((1 + ^) (4Lz.i20^'+'^/^'-''^(7ln2'^)i/('^+^) + (2- - l) 
^ 2'-°+Ci(lnn) ('(2('^+i)/(''+2))'' + (2/p2)^^'j 
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sc: 2'-« + C2 (Inn) ^2(^+i)/('^+2)y" _^ (2/p2)^- j = (Inn) 0(n('^+i)/(^+2)^ ^ 



where Ci,C2 > denote some constants depending only on the parameters but not on n. Note 
that for the last equality we used that the first term in the sum of the two terms that depend on n 



A. 5 Proof of Theorem [12] (uniform upper bound on the regret of HOO 
against the class of all weak Lipschitz environments) 

Equations and (jS]), which follow from Assumption IA21 show that Assumption IA2'I is satisfied 
for L = 2 and all Eq > 0. We take, for instance, Eq = 3i^i. Moreover, since X has a packing 
dimension of D, all environments have a near-optimality dimension less than D. In particular, for 
all D' > D (as shown in the second step of the proof of Theorem [H] in Section fA.ip . there exists a 
constant C (depending only on £, X, Eq = 3j^i, 1^2, and D') such that Assumption IA3I is satisfied. 
We can therefore take /iq = and apply Theorem [HI with z = and M S Tx,el the fact that all the 
quantities involved in the bound depend only on X, i, z/2, D' , and the parameters of HOO, but not 
on a particular environment in concludes the proof. 

A. 6 Proof of Theorem 1131 (minimax lower bound in metric spaces) 

Let i<r ^ 2 an integer to be defined later. We provide first an overview of the proof. Here, we exhibit 
a set A of environments for the {1, . . . , K + Ij-armed bandit problem and a subset J^' C J^x,£ which 
satisfy the following properties. 

(i) The set A contains "difficult" environments for the {1, . . . ,K + l}-armed bandit problem. 

(ii) For any strategy ip^''^^ suited to the ^Y-armed bandit problem, one can construct a strategy 

^(^+1) for the {1, . . . ,K + l}-armed bandit problem such that 



We now provide the details. 

Proof We only deal with the case of deterministic strategies. The extension to randomized strategies 
can be done using Fubini's theorem (by integrating also w.r.t. the auxiliary randomizations used). 

First step. Let ?/ G (0, 1/2) be a real number and K ^ 2 he an integer, both to be defined 
during the course of the analysis. The set A only contains K elements, denoted by ly^, . . . , and 
given by product distributions. For 1 ^ j ^ K , the distribution is obtained as the product of the 
vl when i g {1, . . . , A' + 1} and where 



One can extract the following result from the proof of the lower bound of [lO|, Section 6.9]. 
Lemma 20 For all strategies for the {1, . . . , A' + 1} -armed bandit (where K ^ 2), one has 



dominates the second term. 



VAIeT', Bi^eA, EM[i?„(¥'^'^^)] =IE4i?„(^(^+^^)] . 





K 



1 



77V41n(4/3) 
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Second step. We now need to construct T' such that item (ii) is satisfied. We assume that 
K is such that X contains K disjoint bahs with radius 77. (We shall quantify later in this proof a 
suitable value of K.) Denoting by xi, . . . , xk the corresponding centers, these disjoint balls are then 
B{xi,T]), B{xK,v)- 

With each of these balls we now associate a bandit environment over X, in the following way. 
For all X* G X, we introduce a mapping g^'.r; on X defined by 

5x%r;(a;) maxjO, ri-£{x,x*)} 

for all X € X. This mapping is used to define an environment Mx'^rj over X, as follows. For all 
X G X, 

'1 



M^.^^ix) = Bcr +gx-,nix) 
Let fx'.ri be the corresponding mean-payoff function; its values equal 

fx',7jix) ^ ^+ina.x{0, r]-£{x,x*)} 

for all X € X. Note that the mean payoff is maximized at a; = a;* (with value 1/2 + 77) and is minimal 
for all points lying outside B{x*,r]), with value 1/2. In addition, that £ is a metric entails that these 
mean-payoff functions are 1-Lipschitz and thus are also weakly Lipschitz. (This is the only point in 
the proof where we use that £ is a metric.) In conclusion, we consider 

Third step. We describe how to associate with each (deterministic) strategy (^'^'^^ on A" a 
(random) strategy on the finite set of arms {1, . . . , K + 1}. Each of these strategics is indeed 

given by a sequence of mappings, 

and Vr^TAf^^... 

where for t ^ 1, the mappings V'i'^^ and ip[^~^^^ should only depend on the past up to the beginning 
of round t. Since the strategy cp^'^'^ is deterministic, the mapping ip['^^ takes only into account the 
past rewards Yi, . . . ,lt-i and is therefore a mapping [0, 1]*^^ X. (In particular, ip^'^^ equals a 
constant.) 

We use the notations /( and F/ for, respectively, the arms pulled and the rewards obtained by the 
strategy t/j^^+i) at each round t. The arms are drawn at random according to the distributions 

Vt 1,^1, ■ • ■ , ^t-lJ 1 J • • • J -< t-lj ) 

which we now define. (Actually, they will depend on the obtained payoffs Y-[, . . . , Yl_^ only.) To 
do that, we need yet another mapping T that links elements in X to probability distributions over 
{1, . . . ,K + 1}. Denoting by 5k the Dirac probability on fc S {1, . . . ,K + 1}, the mapping T is 
defined as 



T{x)^ 




ii x^ y B{xj,ri)- 

j = l,...,K 

£(x, X ■) 

Sj H '-^ Sk+1 , if a: e B{xj,i]) for some j E {!,... , K}, 



for all X € X . Note that this definition is legitimate because the balls B{xj,r]) are disjoint when j 
varies between 1 and K. 
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Finally, is defined as follows. For all t ^ 1, 

{![,..., lU y;, YU) = [YU . . . , K_0 = r(^f ' (r/, . . . , r/.,)) . 

Before we proceed, we study the distribution of the reward Y' obtained under (for i € 
{1, . . . ,K}) by the choice of a random arm /' drawn according to T(x), for some x Cz X. Since 
Y' can only take the values or 1, its distribution is a Bernoulli distribution whose parameter fj.i{x) 
we compute now. The computation is based on the fact that under v^, the Bernoulli distribution 
corresponding to arm j has 1/2 as an expectation, except if j = i, in which case it is 1/2 + rj. Thus, 
for all X £ X, 

fl/2, if x^6(a;„77); 

M^) = I ( i{x,Xi) \ /IV i{x,x,) 11 N -f ^ , 

y 1 1 — - — I ( 9 + ) + — - — 2^2'" ^^-^ ' 

That is. Hi = fx,,rj on A". 

Fourth step. We now prove that the distributions of the regrets of (^^'^^^ under M^^^n and of 
^(K+i) y^Yidei are equal for all j ~ 1,...,A'. On the one hand, the expectations of rewards 
associated with the best arms equal 1/2 + under the two environments. On the other hand, one 
can prove by induction that the sequences Yi, 12, • ■ • and y/, Y^^ . . . have the same distribution. (In 
the argument below, conditioning by empty sequences means no conditioning. This will be the case 
only for t = 1.) 

For alH ^ 1, we denote 

X[ = ^f\Y[,...X-,)- 
Under and given F/, . . . , the distribution of Yl is obtained by definition as the two-step 

random draw of 1[ ^ T{X[) and then, conditionally on this first draw, Yl ^ vj, . By the above 
results, the distribution of Y/ is thus a Bernoulli distribution with parameter iij{X[). 
At the same time, under M^^.ti and given Yi, . . . , Yt-i, the choice of 

yields a reward Yt distributed according to A/j^^._,,(Xt), that is, by definition and with the notations 
above, a Bernoulli distribution with parameter fx^^niXt) = fij{Xt). 

The argument is concluded by induction and by using the fact that rewards are drawn indepen- 
dently in each round. 

Fifth step. We summarize what we proved so far. For rj G (0, 1/2), provided that there exist 
K ^ 2 disjoint balls B{xj,r]) in X, we could construct, for all strategies ip^'^'^ for the A'-armed 
bandit problem, a strategy -0(^+1) for the {1, . . . ,K + l}-armed bandit problem such that, for all 
j = 1, . . . , K and all n ^ 1 , 

But by the assumption on the packing dimension, there exists c > such that for all rj < 1/2, 
the choice of Kj^ = \cri~^~\ ^ 2 guarantees the existence of such A',, disjoint balls. Substituting this 
value, and using the results of the first and fourth steps of the proof, we get 

.max E,r,, [Rni^^""^)] = .max E,. [i?„(V(^+i))] ^ m, f 1 - - vV^HW) 

The proof is concluded by noting that 

• the left-hand side is smaller than the maximal regret w.r.t. all weak Lipschitz environments; 
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• the right-hand side can be lower bounded and then optimized over 77 < 1/2 in the following 
way. 



By definition of Kjj and the fact that it is larger than 2, one has 



where C = y (41n(4/3)) / c. We can optimize the final lower bound over 77 G [0, 1/2]. 
To that end, wc choose, for instance, 77 such that C i]^^^ yfn =1/4, that is, 



1 \ 1/(1 + 0/2) / , X l/(l + L>/2) 



4cv^y V4C 



This gives the lower bound 



1 / 1 \ l/(l + £'/2) 1 / 1 \ 1/(1+^/2) 

^ ' ^ ^ „l-l/(i5+2) = i [ _L ) ^(_D+l)/(D+2) 



4 V4C/ 4 V4C 



= 7(c,I3) 

To ensure that this choice of 77 is valid we need to show that 77 ^ 1/2. Since the latter requirement 
is equivalent to 



77 ^ 2 



^ ^ 1/(1 + D/2)N -0+2 
4C 



it suffices to choose the right-hand side to be -/V(c, Z?); we then get that 7/ ^ 1/2 indeed holds for all 
n > 7V(c, D), thus concluding the proof of the theorem. ■ 
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